A Semantic Approach to Internet Tabular Information Extraction

نویسندگان

  • Shih-Hung Wu
  • Huei-Long Wang
  • I-Chi Wang
  • Cheng-Lung Sung
  • Wei-Kuan Shih
  • Wen-Lian Hsu
چکیده

Extracting information from tables is essential for Internet information extraction. However, most web tables are designed in HTML format. To decipher their semantic meanings a system needs to deal with various layouts, which is quite cumbersome. Previous works have two major approaches: layout enumeration approach and wrapper approach. The first approach is to match the table with presorted layout. This approach normally does not perform appropriate information integration since it does not use table semantics. The second approach treats different tables in a case-by-case manner laboriously. We present a semantic approach to automatically recognize tables (in specified knowledge domains) with various layouts. Our system first tags table cells using domain knowledge, solve the multiple tagging ambiguities, and then apply layout-syntax rules to transform tables into database format. Experimental results show that the precision and recall are higher than 93% and 82% respectively in four different domains. Our approach has high precision and is suitable as a front end for wrapper construction.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model

Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Reverse Engineering of Network Software Binary Codes for Identification of Syntax and Semantics of Protocol Messages

Reverse engineering of network applications especially from the security point of view is of high importance and interest. Many network applications use proprietary protocols which specifications are not publicly available. Reverse engineering of such applications could provide us with vital information to understand their embedded unknown protocols. This could facilitate many tasks including d...

متن کامل

Crowd-Sourcing the Large-Scale Semantic Mapping of Tabular Data

Governments and public administrations started recently to publish large amounts of structured data on the Web, mostly in the form of tabular data such as CSV files or Excel sheets. Various tools and projects have been launched aiming at facilitating the lifting of tabular data to reach semantically structured and linked data. However, none of these tools supported a truly incremental, pay-as-y...

متن کامل

Automatic Extraction of Information from the Web

The semantic Web will bring meaning to the Internet, making it possible for web agents to understand the information it contains. However, current trends seem to suggest that the semantic web is not likely to be adopted in the forthcoming years. In this sense, meaningful information extraction from the web becomes a handicap for web agents. In this article, we present a framework for automatic ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006